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This is an exciting piece of work. I agree with the authors that developing com¬ 
putationally tractable methods for hypotheses testing is an important problem 
in statistics that have received little attention to date. In this discussion, I would 
like to put the emphasis on three points presented in the paper under discussion 
that are of particular interest. 

Connection with the statistical learning theory 

The idea of convexification of the loss function in order to construct computa¬ 
tionally tractable procedures has been widely used in statistical learning theory 
[Zhang, 2004] . In this part of the discussion, I would like to share some thoughts 
about the similarities of the two approaches. 

To this end, let me briefly recall the principle of loss convexification in the 
problem of binary classification. One observes n iid pairs {(A^, drawn 

from an unknown distribution P on the product space Xxy with y = {—1,4-1} 
and the goal is to design a prediction rule g : X ^ y with the smallest possible 
misclassification error rate 


Rp{g) = Ep[l(y ^ g{X))] = Ep[t{-Yg{X) > 0)]. 


( 1 ) 


The convexification is achieved in two steps. First, the classification risk is re¬ 
placed by the (/)-risk 


Ap{g)=Ep[cl,{-Yg{X))], 


( 2 ) 


where </> : R —/ M is a convex function often referred to as the convex surrogate 
loss. Second, the set of “pure” classihcation rules g : X ^ y is extended to 
“generalized” rules h : A —/ R with the convention that the predictions fur¬ 
nished by h and sgn(/),) are the same. The (^-risk is accordingly extended to 
all generalized prediction rules: Ap{h) = F,p[(j){—Yh{X))]. As a consequence of 
this construction, if 7^ is a convex subset of the set of all measurable functions 
from X to R, then the computation of the empirical risk minimizer (ERM) 



( 3 ) 
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amounts to solving a convex program. The most common choices for the function 
(j) are the hinge loss (p{u) = (1 + u)+, the exponential loss </>(«) = e“ and the 
logistic loss = log(l + e“). 

Let me turn now to the problem of testing two hypotheses ©o and ©i based 
on n observations Xi,... independently drawn from a distribution Pg* on 
X. Let © = ©0 U ©1 and s:©—^-liljbe the function that equals —1 on the 
set ©0 and +1 on the set ©i. The usual loss of a pure test T : T" —>• {±1} 
associated with a sample = {Xi,..., X„) drawn from Pg, := P®" is 

£(T(x(”^),r) = i(r(x(”)) ^ s(r)) = i(-p(x(”))s(r) > o). 

The corresponding risk is Rpg,(T) = Eg. [£(T(X^"'^), 0*)] and the worst case 
risk is 

e{T) = sup Rpg, (T) = supP,(T(X(”)) ^ s(0)) 

See See 

= supE4l(-T(X('^))s(0) > 0)]. (4) 

See 

Comparing (4) with (1), one can see some clear similarities between the problems 
of finding binary predictors g minimizing the misclassification error rate and that 
of finding testing procedures T minimizing the worst case error rate e(T). In both 
problems the decision rules form a nonconvex set and the performance measure 
is defined as the expected loss for a nonconvex loss function (the Heaviside step 
function). However, there is an important difference consisting in the fact that— 
contrary to (1)— the expectation at the right-hand side of (4) does not admit 
an empirical counterpart that is easily computable from the sample. Therefore, 
even if one applies the aforementioned two steps of convexification, this does 
not readily yield a test procedure computable by solving a convex program (in 
the spirit of (3)). 

Elaborating on these ideas, one can define the following convexified strategy 
for testing the hypothesis ©o against ©i. Given a convex subset R of the set of 
measurable functions from X” to K and a convex loss (() : R —>■ K, define 

-H S G0(/i, 0) = Ee[^( - h(X(”^)s(0))]. (5) 

In this “saddle-point” formulation, the outer minimisation problem has the at¬ 
tractive property of being convex: it has a convex feasible set and a convex 
cost function. Unfortunately, in general, the inner maximization problem is not 
concave and there is no particular reason to expect that it can be efficiently 
solved for any given h when the dimensionality of 6 is large. To circumvent this 
drawback, the authors had the ingenious idea to combine the following three 
facts: 

• the saddle point of G{h,0) coincides with the saddle point of log G(/i, 0), 

• when the model {Pg : 9 £ ©} belongs to an exponential family, it is natural 
to choose R as the linear span of the sufficient statistics: Rq = Span(S'j : 
j = 1,... ,m), 

• for some statistical models^ belonging to an exponential family, for every 
h G Rq, the mapping 0 H> log (Ee[exp(—/i(X^"^))]) is concave. 

^It could be helpful to mention that the concavity property holds for the usual parameter¬ 
ization and does not hold for the parameterization in terms of the natural parameters in the 
sense of exponential families. 
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This leads to the test procedure 

sup logGexp(/i, 0), Gexp(/j, 6 *) = ( 6 ) 

’ he.'Ho see 

The final step of construction aims at convexifying the feasible set of the inner 
maximization problem. In the case when 0 = 0o U 0i with convex sets 0o and 
01 , this aim is achieved by replacing supgg 0 logGexp(^, by the expression 
sup(e ^jgegxei log Gexp(^, + logGexp(^, 0 ), which does not impact the error 
of testing too much in view of the inequalities 

sup logGexp(/l, 6 ») < sup { logGexp(/l,fi') + logGexp(/l,^)} 

See ( 61 , 61 ) 600 X 01 

< 2 sup logGexp(^, 6 ')- 
dee 

An important remark to be made here is that—in the case of exponential loss 
4 >—taking the logarithm of G^ does not break the convexity with respect to h. 
So, in this notation, the test proposed and studied by the authors is 

hTno ^ { log Gexp{h, 0) + log Gexp{h, 9) ] ■ (7) 

(6,9)600X01 

I believe that these explanations shed some additional light on the construction 
proposed in Theorem 2.1 of the paper under discussion. This also raises several 
questions that might be interesting to investigate in the future. In particular, 
a compelling question is to characterize the set of surrogate loss functions (f> 
that lead to computationally tractable testing procedures and for which the 
testing error rate remains small. Another question is the possibility to deal with 
test ( 6 ) directly, without using the final step of convexification. At a heuristic 
level, the risk of /i™?/ should be smaller than that of . Therefore, the 
advantage of the latter would be only computational tractability. I wonder if 
it is possible to efficiently compute the test despite the lack of convex- 

concavity of the cost function, exploiting the facts that (a) for every h, the sup 
of logGexp(^, over 0 can be efficiently computed, and (b) for every 9, the 
minimum of log Gexp(^, fi*) over T-Lq can be efficiently computed as well. 


Reduction to testing simple hypotheses 

The definition of the test given by the authors in Theorem 2.1, see also Eq. (7) 
above, is well suited for the computational purposes but, in my opinion, has 
the inconvenience of hiding the main reason why the proposed test is a natural 
one to use in the setting under consideration. In fact, the proposed test can be 
alternatively defined as follows: in order to distinguish between two (convex) 
hypotheses 0o and 0i based on a sample X ^ Pg*, 

1. Determine the two closest points 9q G 0o and 9i € 0i in terms of 
the Hellinger distance between the corresponding distributions (in other 
terms, find the two representers Pg^ and Pg^ in the families {Pg : 9 G 0o} 
and {Pg : 9 G Qi} that are the hardest to distinguish). This step is com¬ 
pletely data independent. 

2. Apply the standard likelihood-ratio test to the problem of choosing among 
two simple hypotheses Hq ■. 9 = 9tj versus Hi : 9 = 9i. 
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The equivalence of these two definitions follows from the proof of Theorem 2.1, 
see Eq. (52). In Section 2.3.2, this interpretation is presented for the discrete 
observation scheme. At a conceptual level, it is important to underline that the 
same interpretation holds true in the general case as well. However, from a prac¬ 
tical point of view, the definition given in the paper is more convenient than the 
foregoing one since the first step of the latter, generally, is not computationally 
tractable. 


Testing error for inexact solutions 


As it is judiciously noted by the authors, in many practical situations, the exact 
computation of the saddle point in (7) can not be performed. Then, one relies on 
an approximation of the saddle point and it is a central task to assess how this 
approximation error impacts the error of testing. I find it relevant to measure 
the error of approximation in terms of the magnitude of violation of first-order 
optimality conditions (see, for instance, Eq. (8) of the paper under discussion). 
In such a context, the authors establish upper bounds on the error of the test 
based on an approximate solution to the saddle point problem. For example, 
in the case of the Gaussian observation scheme explored in Section 2.3.1, it is 
shown that the worst-case error rate of the test based on the exact solution is 

£. = l-d>(i||E-V2(0o_0,)||J, (8) 

where $ is the cumulative distribution function of the standard normal distri¬ 
bution and { 00 , Oi) is the second argument of the solution to the saddle point 
problem. On the other hand, when an inexact solution {Oq, 9i) is used, with an 
approximation error bounded by i5 > 0, the worst-case error rate satisfies (see 
Eq. (9)): 


e < 1 — d> 


(i||E-V2(0o_0,)||^_ 


|E- 1 / 2 ( 0 o_ 0 i 


:)• 


In my opinion, it is worth complementing this upper bound by another one 
that involves only the exact solution (0o,0i) and, therefore, makes it easier to 
compare the two errors e* and e. In the case of Gaussian observation scheme, 
this can be easily done. In fact, one can deduce from the first-order exact and 
approximate optimality conditions that 

||I]-1/2(0O _ e ^)\\2 - V 5 < ||S-V2(0o _ 0 ^)y < ||S-1/2(0O - 0l)||2 + V 6 (9) 


Since the Gaussian cdf is increasing, we infer from this inequality that 




V5 6 

~ - 0i)h - Vs 


An even more elegant bound can be obtained if the normalized approximate 
optimality condition is used: \/{9,9) G ©o x 0i, it holds 

(01 - 9o)E-\9 - 9o) + (00 - 0i)E-i(0 - 0i) < S\\E-^/^{9o - V)\\l 


In this case, inequalities (9) take the form 


|S-V^(0o-0i) 

i + Vs 


< ||E- 1 / 2 ( 0 o_ 0 ^)|| 2 < 


|I]-l/2(0o-0i 

1-Vs 


( 10 ) 






Dalalyan, A.S./Discussion on “Hypotheses testing by convex optimization’ 


5 


and we get 


e < 1 - $ 


a-) 


1 + V5 


This inequality allows for an easy comparison of i and e* in the case of Gaus¬ 
sian observations. In the case of other observation schemes, deriving this type 
of upper bounds seems to be more challenging and constitutes an interesting 
avenue of future research. 
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